{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Tfidf Ranking\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deeppavlov/DeepPavlov/blob/master/docs/features/models/tfidf_ranking.ipynb)\n",
    "\n",
    "# Table of contents \n",
    "\n",
    "1. [Introduction to the task](#1.-Introduction-to-the-task)\n",
    "\n",
    "2. [Get started with the model](#2.-Get-started-with-the-model)\n",
    "\n",
    "3. [Models list](#3.-Models-list)\n",
    "\n",
    "4. [Use the model for prediction](#4.-Use-the-model-for-prediction)\n",
    "\n",
    "    4.1. [Predict using Python](#4.1-Predict-using-Python)\n",
    "    \n",
    "    4.2. [Predict using CLI](#4.2-Predict-using-CLI)\n",
    "\n",
    "5. [Customize the model](#5.-Customize-the-model)\n",
    "    \n",
    "    5.1. [Fit on Wikipedia](#5.1-Fit-on-Wikipedia)\n",
    "    \n",
    "    5.2. [Download, parse new Wikipedia dump, build database and index](#5.2-Download,-parse-new-Wikipedia-dump,-build-database-and-index)\n",
    "\n",
    "# 1. Introduction to the task\n",
    "\n",
    "This is an implementation of a passage ranker based on tf-idf vectorization.\n",
    "The ranker implementation is based on [DrQA](https://github.com/facebookresearch/DrQA/) project.\n",
    "The default ranker implementation takes a batch of queries as input and returns 100 passage titles sorted via relevance.\n",
    "\n",
    "# 2. Get started with the model\n",
    "\n",
    "First make sure you have the DeepPavlov Library installed.\n",
    "[More info about the first installation.](http://docs.deeppavlov.ai/en/master/intro/installation.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q deeppavlov"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then make sure that all the required packages for the model are installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m deeppavlov install en_ranker_tfidf_wiki"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`en_ranker_tfidf_wiki` is the name of the model's *config_file*. [What is a Config File?](http://docs.deeppavlov.ai/en/master/intro/configuration.html)\n",
    "\n",
    "There are alternative ways to install the model's packages that do not require executing a separate command -- see the options in the next sections of this page.\n",
    "The full list of models for tfidf ranking with their config names can be found in the [table](#3.-Models-list).\n",
    "\n",
    "# 3. Models list\n",
    "\n",
    "| Config | Language | Description | RAM |\n",
    "| :--- | :---: | :--- | :---: |\n",
    "| doc_retrieval/en_ranker_tfidf_wiki.json | En | Config for TF-IDF ranking over Wikipedia | 2.9 Gb |\n",
    "| doc_retrieval/en_ranker_pop_wiki.json | En | Config for TF-IDF ranking, followed by <br> popularity ranking, over Wikipedia | 8.1 Gb |\n",
    "| doc_retrieval/ru_ranker_tfidf_wiki.json | Ru | TF-IDF ranking config over Wikipedia | 8.4 Gb |\n",
    "\n",
    "# 4. Use the model for prediction\n",
    "\n",
    "## 4.1 Predict using Python\n",
    "\n",
    "### English\n",
    "\n",
    "Building (if you don't have your own data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from deeppavlov import build_model, configs\n",
    "\n",
    "ranker = build_model(configs.doc_retrieval.en_ranker_tfidf_wiki, download=True, install=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[18155097, 628663, 17123727, 628662, 19097375]\n"
     ]
    }
   ],
   "source": [
    "result = ranker(['Who is Ivan Pavlov?'])\n",
    "print(result[0][:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Russian"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from deeppavlov import build_model, configs\n",
    "\n",
    "ranker = build_model(configs.doc_retrieval.ru_ranker_tfidf_wiki, download=True, install=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[4902620, 1900377, 11129584, 1720563, 1720658]\n"
     ]
    }
   ],
   "source": [
    "result = ranker(['Когда произошла Куликовская битва?'])\n",
    "print(result[0][:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Text for the output titles can be further extracted with [deeppavlov.vocabs.wiki_sqlite.WikiSQLiteVocab](https://docs.deeppavlov.ai/en/master/apiref/vocabs.html#deeppavlov.vocabs.wiki_sqlite.WikiSQLiteVocab) class.\n",
    "\n",
    "## 4.2 Predict using CLI\n",
    "\n",
    "You can also get predictions in an interactive mode through CLI (Сommand Line Interface)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! python -m deeppavlov interact en_ranker_tfidf_wiki -d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Customize the model\n",
    "\n",
    "## 5.1 Fit on Wikipedia\n",
    "\n",
    "Run the following to fit the ranker on **English** Wikipedia:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "python -m deppavlov train en_ranker_tfidf_wiki"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the following to fit the ranker on **Russian** Wikipedia:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "python -m deeppavlov train ru_ranker_tfidf_wiki"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a result of ranker training, a SQLite database and tf-idf matrix are created.\n",
    "\n",
    "## 5.2 Download, parse new Wikipedia dump, build database and index\n",
    "\n",
    "**enwiki.db** SQLite database consists of ~21 M Wikipedia articles and is built by the following steps:\n",
    "\n",
    "- Download a Wikipedia dump file. We took the latest\n",
    "   [enwiki dump](https://dumps.wikimedia.org/enwiki/20230501/)\n",
    "\n",
    "- Unpack and extract the articles with [WikiExtractor](https://github.com/attardi/wikiextractor)\n",
    "   (with ``--json``, ``--no-templates``, ``--filter_disambig_pages``\n",
    "   options)\n",
    "\n",
    "- [Build a database](#5.1-Fit-on-Wikipedia).\n",
    "\n",
    "**enwiki_tfidf_matrix.npz** is a full Wikipedia tf-idf matrix of size **hash_size x number of documents** which is\n",
    "$2^{24}$ x 21 M. This matrix is built with [deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer](https://docs.deeppavlov.ai/en/master/apiref/models/vectorizers.html#deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer) class.\n",
    "\n",
    "**ruwiki.db** SQLite database consists of ~12 M Wikipedia articles and is built by the following steps:\n",
    "\n",
    "- Download a Wikipedia dump file. We took the latest [ruwiki dump](https://dumps.wikimedia.org/ruwiki/20230501/)\n",
    "\n",
    "- Unpack and extract the articles with [WikiExtractor](https://github.com/attardi/wikiextractor)\n",
    "   (with ``--json``, ``--no-templates``, ``--filter_disambig_pages``\n",
    "   options)\n",
    "\n",
    "- [Build a database](#5.1-Fit-on-Wikipedia).\n",
    "\n",
    "**ruwiki_tfidf_matrix.npz** is a full Wikipedia tf-idf matrix of size **hash_size x number of documents** which is\n",
    "$2^{24}$ x 12 M. This matrix is built with\n",
    "[deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer](https://docs.deeppavlov.ai/en/master/apiref/models/vectorizers.html#deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer) class."
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}
